Abstract
Neural networks need the right representations of input data to learn. Here we ask how gradient-based learning shapes a fundamental property of representations in recurrent neural networks (RNNs)—their dimensionality. Through simulations and mathematical analysis, we show how gradient descent can lead RNNs to compress the dimensionality of their representations in a way that matches task demands during training while supporting generalization to unseen examples. This can require an expansion of dimensionality in early timesteps and compression in later ones, and strongly chaotic RNNs appear particularly adept at learning this balance. Beyond helping to elucidate the power of appropriately initialized artificial RNNs, this fact has implications for neurobiology as well. Neural circuits in the brain reveal both high variability associated with chaos and low-dimensional dynamical structures. Taken together, our findings show how simple gradient-based learning rules lead neural networks to solve tasks with robust representations that generalize to new cases.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All data used in the paper are generated by the code at refs. 61.
Code availability
Code for training the networks and generating the plots can be found in a Code Ocean capsule61.
Change history
19 October 2022
A Correction to this paper has been published: https://doi.org/10.1038/s42256-022-00565-6
References
Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. EC-14, 326–334 (1965).
Fusi, S., Miller, E. K. & Rigotti, M. Why neurons mix: high dimensionality for higher cognition. Curr. Opin. Neurobiol. 37, 66–74 (2016).
Vapnik, V. N. Statistical Learning Theory (Wiley-Interscience, 1998).
Litwin-Kumar, A., Harris, K. D., Axel, R., Sompolinsky, H. & Abbott, L. F. Optimal degrees of synaptic connectivity. Neuron 93, 1153–1164 (2017).
Cayco-Gajic, N. A., Clopath, C. & Silver, R. A. Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nat. Commun. 8, 1116 (2017).
Wallace, C. S. & Boulton, D. M. An information measure for classification. Comput. J. 11, 185–194 (1968).
Rissanen, J. Modeling by shortest data description. Automatica 14, 465–471 (1978).
Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
Ansuini, A., Laio, A., Macke, J. H. & Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Adv. Neural Inf. Process. Syst. 32, 11 (2019).
Recanatesi, S. et al. Dimensionality compression and expansion in Deep Neural Networks. Preprint at https://arxiv.org/abs/1906.00443 (2019).
Cohen, U., Chung, S. Y., Lee, D. D. & Sompolinsky, H. Separability and geometry of object manifolds in deep neural networks. Nat. Commun. 11, 746 (2020).
Jaeger, H. The ‘Echo State’ Approach to Analysing and Training Recurrent Neural Networks—with an Erratum Note. GMD Technical Report 148 (German National Research Center for Information Technology, 2001).
Maass, W., Natschläger, T. & Markram, H. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14, 2531–2560 (2002).
Legenstein, R. & Maass, W. Edge of chaos and prediction of computational performance for neural circuit models. Neural Netw. 20, 323–334 (2007).
Keup, C., Tobias K., David D. & Moritz H. Transient chaotic dimensionality expansion by recurrent networks. Physical Review X 11 (June 2021): 021064. https://doi.org/10.1103/PhysRevX.11.021064
Vreeswijk, C. V. & Sompolinsky, H. Chaotic balanced state in a model of cortical circuits. Neural Comput. 10, 1321–1371 (1998).
Litwin-Kumar, A. & Doiron, B. Slow dynamics and high variability in balanced cortical networks with clustered connections. Nat. Neurosci. 15, 1498–1505 (2012).
Wolf, F., Engelken, R., Puelma-Touzel, M., Weidinger, J. D. F. & Neef, A. Dynamical models of cortical circuits. Curr. Opin. Neurobiol. 25, 228–236 (2014).
Lajoie, G., Lin, K. & Shea-Brown, E. Chaos and reliability in balanced spiking networks with temporal drive. Phys. Rev. E 87, 2432–2437 (2013).
London, M., Roth, A., Beeren, L., Häusser, M. & Latham, P. E. Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex. Nature 466, 123–127 (2010).
Stam, C. J. Nonlinear dynamical analysis of EEG and MEG: review of an emerging field. Clin. Neurophysiol. 116, 2266–2301 (2005).
Engelken, R. & Wolf, F. Dimensionality and entropy of spontaneous and evoked rate activity. In APS March Meeting Abstracts, Bull. Am. Phys. Soc. eP5.007 (2017).
Kaplan, L. J. & Yorke, J. A. In Functional Differential Equations and Approximations of Fixed Points: Proceedings, Bonn, July 1978 204–227 (Springer, 1979).
Sussillo, D. & Abbott, L. F. Generating coherent patterns of activity from chaotic neural networks. Neuron 63, 544–557 (2009).
DePasquale, B., Cueva, C. J., Rajan, K., Escola, G. S. & Abbott, L. F. full-FORCE: A target-based method for training recurrent networks. PLoS ONE 13, e0191527 (2018).
Stern, M., Olsen, S., Shea-Brown, E., Oganian, Y. & Manavi, S. In the footsteps of learning: changes in network dynamics and dimensionality with task acquisition. In Proc. COSYNE 2018, abstract no. III-100.
Farrell, M. Revealing Structure in Trained Neural Networks Through Dimensionality-Based Methods. PhD thesis, Univ. Washington (2020).
Rajan, K., Abbott, L. F. & Sompolinsky, H. Stimulus-dependent suppression of chaos in recurrent neural networks. Phys. Rev. E 82, 011903 (2010).
Bell, R. J. & Dean, P. Atomic vibrations in vitreous silica. Discuss. Faraday Soc. 50, 55–61 (1970).
Gao, P., Trautmann, E., Yu, B. & Santhanam, G. A theory of multineuronal dimensionality, dynamics and measurement. Preprint at bioRxiv https://doi.org/10.1101/214262 (2017).
Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999).
Goodfellow, I., Lee, H., Le, Q. V., Saxe, A. & Ng, A. Y. Measuring invariances in deep networks. Adv. Neural Inf. Process. Syst. 22, 646–654 (2009).
Lajoie, G., Lin, K. K., Thivierge, J.-P. & Shea-Brown, E. Encoding in balanced networks: revisiting spike patterns and chaos in stimulus-driven systems. PLoS Comput. Biol. 12, e1005258 (2016).
Huang, H. Mechanisms of dimensionality reduction and decorrelation in deep neural networks. Phys. Rev. E 98, 062313–062322 2018).
Kadmon, J. & Sompolinsky, H. Optimal architectures in a solvable model of deep networks. Adv. Neural Inf. Process. Syst. 29, 4781–4789 (2016).
Papyan, V., Han, X. Y. & Donoho, D. L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl Acad. Sci. USA 117, 24652–24663 (2020).
Shwartz-Ziv, R. & Tishby, N. Opening the black box of Deep Neural Networks via information. Preprint at https://arxiv.org/abs/1703.00810 (2017).
Shwartz-Ziv, R., Painsky, A. & Tishby, N. Representation compression and generalization in Deep Neural Networks. Preprint: OpenReview (2019).
Babadi, B. & Sompolinsky, H. Sparseness and expansion in sensory representations. Neuron 83, 1213–1226 (2014).
Marr, D. A theory of cerebellar cortex. J. Physiol. 202, 437–470.1 (1969).
Albus, J. S. A theory of cerebellar function. Math. Biosci. 10, 25–61 (1971).
Stringer, C., Pachitariu, M., Steinmetz, N., Carandini, M. & Harris, K. D. High-dimensional geometry of population responses in visual cortex. Nature 571, 361–365 (2019).
Mazzucato, L., Fontanini, A. & LaCamera, G. Stimuli reduce the dimensionality of cortical activity. Front. Syst. Neurosci. 10, 11 (2016).
Rosenbaum, R., Smith, M. A., Kohn, A., Rubin, J. E. & Doiron, B. The spatial structure of correlated neuronal variability. Nat. Neurosci. 20, 107–114 (2017).
Landau, I. D. & Sompolinsky, H. Coherent chaos in a recurrent neural network with structured connectivity. PLoS Comput. Biol. 14, e1006309 (2018).
Huang, C. et al. Circuit models of low-dimensional shared variability in cortical networks. Neuron 101, 337–348.e4 (2019).
Mastrogiuseppe, F. & Ostojic, S. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron 99, 609–623.e29 (2018).
Mazzucato, L., Fontanini, A. & La Camera, G. Dynamics of multistable states during ongoing and evoked cortical activity. J. Neurosci. 35, 8214–8231 (2015).
Cunningham, J. P. & Yu, B. M. Dimensionality reduction for large-scale neural recordings. Nat. Neurosci. 17, 1500–1509 (2014).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016); http://www.deeplearningbook.org
Faisal, A. A., Selen, L. P. J. & Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 292–303 (2008).
Freedman, D. J. & Assad, J. A. Experience-dependent representation of visual categories in parietal cortex. Nature 443, 85–88 (2006).
Dangi, S., Orsborn, A. L., Moorman, H. G. & Carmena, J. M. Design and analysis of closed-loop decoder adaptation algorithms for brain–machine interfaces. Neural Comput. 25, 1693–1731 (2013).
Orsborn, A. L. & Pesaran, B. Parsing learning in networks using brain–machine interfaces. Curr. Opin. Neurobiol. 46, 76–83 (2017).
Recanatesi, S. et al. Predictive learning as a network mechanism for extracting low-dimensional latent space representations. Nat. Commun. 12, 1417 (2021).
Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429 (2018).
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In 5th International Conference on Learning Representations https://doi.org/10.48550/arXiv.1609.04836 (2017).
Advani, M. S., Saxe, A. M. & Sompolinsky, H. High-dimensional dynamics of generalization error in neural networks. Neural Netw. 132, 428–446 (2020).
Li, Y. & Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. in Advances in neural information processing systems (eds. Bengio, S. et al.) vol. 31 (Curran Associates, Inc., 2018).
Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. https://arxiv.org/abs/1506.00019 (2015).
Farrell, M. Gradient-based learning drives robust representations in RNNs by balancing compression and expansion. Code Ocean https://doi.org/10.24433/CO.5101546.v1 (2022).
Acknowledgements
M.F. was funded by the National Science Foundation Graduate Research Fellowship under Grant DGE-1256082. G.L. is funded by an NSERC Discovery Grant (RGPIN-2018-04821), an FRQNT Young Investigator Startup Program (2019-NC-253251) and an FRQS Research Scholar Award, Junior 1 (LAJGU0401-253188). E.S.-B. acknowledges the support of NSF DMS Grant 1514743. M.F. thanks the Swartz Program in Theoretical Neuroscience at Harvard and S.R. thanks the Swartz Center for Theoretical Neuroscience at the University of Washington for support. We thank M. Stern, D. Chklovskii, A. Weber, N. Steinmetz and L. Mazzucato for their insights and suggestions. M.F. would also like to thank H. Sompolinsky and S. Chung for their mentorship and inspiration.
Author information
Authors and Affiliations
Contributions
M.F., S.R. and E.S.-B. conceived the study. M.F. wrote code and ran simulations with some guidance from S.R. The manuscript was primarily written by M.F., with substantial edits and contributions made by S.R., G.L. and E.S.-B. G.L. contributed code for computing Lyapunov exponents and provided additional insight. T.M. ran the simulations for Extended Data Fig. 9 and ran additional verification experiments for intermediate values of β.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Cristina Savin and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Effects of changing the evaluation timestep and number of recurrent units.
Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify high-dimensional inputs. Details are as in Fig. 2e. Shaded regions are as defined in Fig. 2e. First row: network trained with a categorical cross-entropy loss with a learning rate of 1e-4. Second row: network trained with a mean squared error loss with a learning rate of 1e-3. First column: evaluation time is t = 6. Second column: evaluation time is t = 10. Third column: evaluation time is t = 14. Fourth column: Number of hidden neurons is increased to N = 300. Evaluation time is t = 14.
Extended Data Fig. 2 Effects of changing the evaluation timestep, input dimension, and number of recurrent units.
Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. All networks are trained with a categorical cross-entropy loss and a learning rate of 1e-4 (note that this is a factor of 10 less than used in the main text). Shaded regions are as defined in Fig. 2e. Other details are as in Fig. 3e. First row: 2-dimensional inputs. Second row: 4-dimensional inputs. Third row: 10-dimensional inputs. First column: evaluation time is t = 6. Second column: evaluation time is t = 10. Third column: evaluation time is t = 14. Fourth column: Number of hidden neurons is increased to N = 300. Evaluation time is t = 14.
Extended Data Fig. 3 Effects of changing the evaluation timestep, input dimension, and number of recurrent units on logistic regression testing accuracy.
Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. Details are as in Extended Data Fig. 2, but this time measuring the logistic regression testing accuracy as defined in the main text.
Extended Data Fig. 4 Effects of changing the evaluation timestep, input dimension, and number of recurrent units with 120 input clusters.
Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. Details are as in Extended Data Fig. 2, but here 120 input clusters are used instead of 60.
Extended Data Fig. 5 Between-class distances are increased while within class distances are diminished by the network dynamics.
Mean pairwise distance between points belonging to the same class (dashed lines), mean pairwise distance between points belonging to different classes (dotted lines), and the ratio of the first to the second (blue lines and axes), for the representations of trained networks over time t. Details are as in Fig. 2e in the main text. Shaded regions are as defined in Fig. 2e. a. Edge-of-chaos network as defined in the main text. b. Strongly chaotic network as defined in the main text.
Extended Data Fig. 6 Dependence of dimensionality on the learning rate.
Here we reproduce the results of Figs. 2e and 3e of the main text, using a different learning rate. Red lines correspond to strongly chaotic and cyan lines to edge-of-chaos networks. Dashed and solid lines depict before and after training, respectively. Shaded regions are as defined in Fig. 2e. Top row: high-dimensional inputs as in Fig. 2e. Bottom row: low-dimensional inputs as in Fig. 3e. Left column: learning rate of 1e-4. Right column: learning rate of 1e-3, as in the main text.
Extended Data Fig. 7 Dimensionality increases with number of class labels, but not number of clusters.
Effective dimensionalities (EDs) of the trained network responses to inputs embedded in an N-dimensional space, measured at the evaluation time teval = 10. Error bars denote two standard deviations of three initializations of task and networks (in all panels they are too small to see). Details are similar to Fig. 2. a. Edge-of-chaos networks. Blue: ED of the inputs. Green: ED of the network representation as a function of the number of input clusters. Dimensionality remains flat and small. b. Edge-of-chaos networks. Green: ED of the network representation as a function of the number of class labels. Black: Effective dimensionality of points distributed uniformly at random in an N-dimensional ball. The number of points drawn is determined by the number of class labels. This is to roughly measure what the ED of the network would be if it formed a fixed point for every class label, and distributed these fixed points randomly in space. c. Strongly chaotic networks. Legend as in a. d. Strongly chaotic networks. Legend as in b.
Extended Data Fig. 8 Example of noise in output weights driving compression of the hidden representation in a linear network with two hidden layer units.
The equation for the network is h = Wx + b with output \(\hat{o}={{{{\boldsymbol{r}}}}}^{T}{{{\boldsymbol{h}}}}\). The input weights (red) are initialized to the 2 × 2 identity matrix, and bias is initialized as (1, 0). The inputs are placed on a grid from x = − 1 to x = 2 and from y = − 3 to y = 3 (not shown). Network output \(\hat{o}\) is trained to minimize the squared error loss \(0.5{(\hat{o}-1)}^{2}\). Input samples are chosen randomly, and input weights are updated via stochastic gradient descent with batch size 1. a. Top: Diagram of network where input weights are trained and output weights are fixed. Bottom: Diagram of network where input weights are trained and output weights are drawn from a normal distributed with mean (1, 0) and covariance 0.05I at every update step. In the figure, η represents additive white noise. Middle: hidden unit responses (blue circles) to the inputs before training (iteration 0). Black dot denotes the output weight vector, and the blue line is the affine subspace of points that r maps to 1. b. Evolution of the hidden layer response to inputs (representation) as input weights are trained. Top: Representation of the network where output weights are fixed. The iteration number denotes the number of training samples that have been used to update the weights. Activations compress to the space orthogonal to r, shifted by (1, 0). Bottom: Representation of the network where output weights are randomly drawn at every input sample presentation. Activations compress to a compact, localized space. The direction of compression is both along and orthogonal to r.
Extended Data Fig. 9 Representations of RNNs trained on the MNIST digit recognition dataset.
a. Effective dimensionality (ED) of RNNs trained on the MNIST digit recognition dataset. The ED of the network’s responses to test inputs is plotted. After training, dimensionality compresses down to a value by t = 10 that roughly matches the number of class labels (10). This compression is similar to that seen in Fig. 2 of the main text. Details are as in Fig. 2e of the main text. Shaded regions are as defined in Fig. 2e. b. Projection onto the top three principal components of MNIST test data. Colours indicate true class label (i.e., digit identity). c. Projection onto the top three principal components of the edge-of-chaos recurrent network’s responses to the inputs in b after training, at the evaluation time t = 10. Colours indicate true class label as in b. The network forms a localized cluster for each digit.
Extended Data Fig. 10 Effects of changing the initial coupling strength β.
Details are as in Fig. 3e, except that here we vary the coupling parameter β, whose value is indicated by the colourbar to the right. Shaded regions are as defined in Fig. 2e. First column: ED of networks before training. Second column: ED of networks after training with a learning rate of 1e-3. Third column: ED of networks after training with a learning rate of 1e-4.
Supplementary information
Supplementary Information
Supplementary Appendix.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Farrell, M., Recanatesi, S., Moore, T. et al. Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nat Mach Intell 4, 564–573 (2022). https://doi.org/10.1038/s42256-022-00498-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00498-0
This article is cited by
-
Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization
Nature Machine Intelligence (2024)